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We propose an automatic Bayesian approach to the selection of covariates 
and their penalised splines transformations in generalised additive models. 
Specification of a default, hyper-^ prior for the model parameters and a 
multiplicity-correction prior for the models themselves is crucial for this 
task. We introduce the methodology in the normal model and extend it 
to non-normal exponential families. Two applications from the literature 
illustrate the proposed approach. An efficient implementation is available 
in an R-package. 

Keywords: penalised splines, Bayesian variable selection, ^-prior, shrinkage. 

1. Introduction 

Semiparametric regression has achieved an impressive dissemination over the last 
years. Its central idea is to replace parametric regression functions by smooth, semi- 
parametric components. Following Hastie and Tibshirani (1990), suppose we have p 
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continuous covariates Xi,...,Xp and use the additive model 

P 

y = ^o + X]'^;(^;)+e. (1) 

where mj are smooth but otherwise unspecified functions and e ~ N(0, ct"^). For iden- 
tifiability purposes we further assume that E(my(Xy)) = with respect to the marginal 
distribution of covariate Xj. Estimation of the smooth terms in (1) can be carried out in 
different ways, where we here make use of penalised splines, see e. g. Filers and Marx 
(2010) or Wood (2006). A general introduction to penalised spline smoothing has 
been provided by Ruppert, Wand, and Carroll (2003) and the approach has become a 
popular smoothing technique since then, see Ruppert, Wand, and Carroll (2009). The 
general idea is to decompose the function nij into a linear and a nonlinear part, where 
the latter is represented through a spline basis, that is 

mj{xj) = Xj^j + Zj{xjfuj. (2) 

Here Zj{xj) is a X x 1 spline basis vector at position Xj and Uj is the corresponding 
coefficient vector. Conveniently one may choose a truncated polynomial basis for 
Zj[-) but representation (2) holds in general as well, see Wand and Ormerod (2008). 
To achieve a smooth fit one imposes a (quadratic) penalty on the spline coefficient 
vector Uj which is formulated as the normal prior 

Uj ^NK{0K,O-'^PjlK), (3) 

where 0^ is the all-zeros vector and 1^ is the identity matrix of dimension K. Here the 
variance factor pj steers the amount of penalisation (relative to the regression variance 
a^). A larger pj leads to a higher prior variance of the spline coefficients and hence 
a more wiggly function nij, while a smaller pj leads to a stronger penalty on ||My|| 
and thus a smoother function nij. Setting pj to zero imposes Uj = 0^ so that mj{xj) 
collapses to a linear term mj{xj) = Xjjij. Hence the role of pj (; = 1, . . . , p) extends to 
the selection of (generalised) additive models, which will be the focus of this paper. 
Variable selection will be treated by allowing the alternative nij{xj) = 0. 

Variable selection in generalised additive models is important to reduce the vari- 
ance of effect estimates due to uninformative covariates. The field is wide and many 
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different approaches have been proposed in the last years. Friedman (2001) and 
Tutz and Binder (2006) describe boosting algorithms, which are extended by Kneib, Hothorn, and Tutz 
(2009) to geoadditive regression models (Fahrmeir, Kneib, and Lang, 2004). For the 
same model class, Belitz and Lang (2008) propose to use information-criteria or cross- 
validation, while Fahrmeir, Kneib, and Konrath (2010) and Panagiotelis and Smith (2008) 
use spike-and-slab priors for variable and function selection. Brezger and Lang (2008) 
adopt the concept of Bayesian contour probabilities (Held, 2004) to decide on the in- 
clusion and form of covariate effects. Cottet, Kohn, and Nott (2008) generalise earlier 
work by Yau, Kohn, and Wood (2003) to Bayesian double-exponential regression mod- 
els, which comprise generalised additive models as a special case. Shrinkage ap- 
proaches are proposed by Wood (2011) and Marra and Wood (2011). Zhang and Lin 
(2006) use a lasso-type penalised likelihood approach, and Ravikumar, Liu, Lafferty and Wasserman 
(2008) and Meier, van de Geer, and Biihlmann (2009) use penalties favouring both sparsity 
and smoothness of high-dimensional models. Likelihood-ratio testing methods are de- 
scribed by Kauermann and Tutz (2001) and Cantoni and Hastie (2002). This list mir- 
rors the multitude as well as the variety of the different approaches and the enumera- 
tion is, of course, in no way exhaustive. 

In this paper we propose a novel Bayesian variable and function selection approach 
based on (generalised) hyper-^ priors. This type of prior for the parameters in the gen- 
eralised additive model traces back to the g-prior in the linear model (Zellner, 1986). 
Its hyper-parameter g acts as an inverse relative prior sample size, and assigning it a 
hyper-prior can solve the information paradox (Liang, Paulo, Molina, Clyde, and Berger, 
2008, section 4.1) of the fixed-^ case (Berger and Pericchi, 2001, p. 148) in the linear 
model. We will subsequently refer to such mixtures of g-priors generically as hyper-g 
priors. One specific example is the hyper-g prior of Liang et al. (2008, section 3.2), 
which enjoys a closed form for the marginal likelihood and leads to consistent model 
selection and model-averaged prediction. We follow the conventional prior approach 
(Berger and Pericchi, 2001, section 2.1) by using non-informative improper priors for 
parameters which are common to all models, and default proper hyper-g priors for 
model-specific parameters. 

The current work generalises the hyper-g prior for generalised linear models (Sabanes Bove and Held, 
2011a). In the same paper, we showed how fractional polynomials (Sabanes Bove and Held, 
2011b), which extend ordinary polynomials by square roots, reciprocals and the log- 
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arithm, can be used to model nonlinear covariate effects. However, fractional polyno- 
mials have the disadvantage of being not invariant to linear transformations of the co- 
variates. For variable and function selection, Fahrmeir et al. (2010) and Scheipl (2010) 
use a mixture of two inverse-gamma distributions with a very small ("spike") and a 
larger mean ("slab") as a hyper-prior for the variances of the regression coefficients' 
independent normal priors. The posterior probability for inclusion of a coefficient is 
then estimated from the proportion of Markov chain Monte Carlo (MCMC) variance 
samples in the "slab". While this prior structure eases the MCMC algorithm, it does 
not take into account the correlation structure of the covariates, and depends on the 
specification of the prior means in the two mixture components. Cottet et al. (2008) 
also use independent normal inverse-gamma priors for the regression coefficients, but 
they explicitly exclude coefficients from the model. For nonlinear effects they utilise 
low-rank approximations of smoothing splines, which require the choice of a threshold 
on the eigenvalue scale. 

The paper is organised as follows. We first apply the hyper-g^ prior of Liang et al. 
(2008) to additive models in Section 2. The methodology is extended to generalised 
additive models in Section 3. A multiplicity-correction prior on the model space and a 
stochastic search procedure are proposed in Section 4. We apply our approach to real 
data in Section 5 and suggest postprocessing techniques in Section 6. Section 7 closes 
the paper with a discussion. 



2. Hyper-^ Priors for Additive Models 

Assume we have observed independent responses at covariate values Xn, . . . , Xjp, 
i = 1, ...,n, from the additive normal model (1). For each covariate / = 1, 
we stack the covariate values into the n x 1 vector Xj = {xy, . . .,x„j)^ and the spline 
basis vectors into the n x K matrix Zj = {Zj{xijY , . . . ,Zj{xnjYY ■ The subsequent 
Gram-Schmidt process (see Bjorck, 1967) 

Xj = Xj In J — ^j '^n^jr (4) 

_ , _ llZj _ xJZj 

Zj — Zj — In—f- '^j^f ' 
1„ In ^j 
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where 1„ denotes the all-ones vector of dimension n, ensures that 1„, Xj and the 
columns of Zj are orthogonal to each other, i. e. l^Xj = and l^Zj = xJZj = 0^. 

A central measure of model complexity is the degrees of freedom. While in para- 
metric models this is just the number of parameters, for smoothing and mixed models 
Aerts, Claeskens, and Wand (2002, section 2.2) translate the variance factor pj into the 
corresponding degrees of freedom 

djipj) = tr{{ZjZj + prHy'zJZj} + 1 g (1, X + 1) (6) 

for a smoothly modelled covariate effect nij. Note that dj{pj) = Ylk=i ^i^l i'^jk + P^^) 
is easy to calculate via the (positive) eigenvalues Ayj. of ZjZj. This also shows that 
dj{pj) is strictly increasing with derivative Ayj./ {pj^jk + 1)^ > 0/ which implies 
that we may (numerically) invert the function to Pj{dj). In fact, by fixing the degrees 
of freedom dj for function mj{xj) we define the variance factor pj. Subsequently we 
will restrict the degrees of freedom to take values in a finite set V C {0} U [1, K + 1), 
say T> = {0,1,2,3, ... ,K}. For dj = we set nij {xj) = while for dj = 1 we have the 
linear model mj{xj) = Xjfij. In general, model (1) is indexed hy d = (rfi, . . . , dp) giving 
the degrees of freedom for each functional component and hence the structure of the 
model. 

After combining the I = I^jLi ^{^^j ^1) vectors Xj to the n x I linear design matrix 
= {xj '. dj > 1) and the / = jj^^i ^{dj > 1) matrices Zj to the n x JK spline design 
matrix = (Zy : > 1), and analogously constructing the respective coefficient vec- 
tors and u^, the conditional additive model for the response vector y = (yi, . . . , y„ ) ^ 
is 

y I ^0, ^d' Ud, (T^ ~ N„ {\nf>Q + Xd/S^ + ZdUd, (T^ln) ■ (7) 

Integrating out the the spline coefficient vector ~ Njx(Ojx,cr^D(j), where is 
block-diagonal with / blocks Pj^K (dj > 1), yields the marginal model 

y I /3o, )8^, cr^ ~ N„ (l„/3o + X^jS^, cr^Va) (8) 

with Vd = In + Z^D^Z^. This general linear model can be decorrelated into a stand- 
ard linear model by using the Cholesky decomposition = ^^^^^]/^- For the trans- 
formed response vector y = V^^^^y we have 

y I /3o, )8^, ~ N„ {Infio + Xa^d' ^'^n) (9) 
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with analogously transformed all-ones vector 1„ = '^^^In and design matrix = 
V^^^^Xd. Note that now also y and 1„ depend on the model d, but we suppress this 
dependence for ease of notation. 

We will now show how to use the hyper-^ prior (Liang et al., 2008) for the para- 
meters j5o, and in the decorrelated model (9). The hyper-^ prior consists of a 
locally uniform prior /(j6o) 1 on the intercept, Jeffreys' prior f{cr'^) oc {cr^)~^ on the 
regression variance and the ^-prior (Zellner, 1986) 

^,\gy^Ni(Oj,ga^{x]Xa)-') (10) 

on the linear coefficient vector, which is combined with a uniform hyper-prior on 
the shrinkage coefficient g/ (1 + g)- Note that the prior precision matrix in (10) is 
proportional to cr^-^X^X^ = a^'^X^^V^^X^, which is the Fisher information matrix of 
in model (8). Basically all formulae given by Liang et al. (2008) carry over to our 
setting, since inner products of the response vector y, the all-ones vector 1„ and the 
design matrix X^ in model (8) carry over to their transformed counterparts y, !„ and 
X^ in model (9). This is due to 

= (In + ZaDaZl)-' = I„ - Za{ZjZa + D^Y'^l (11) 

which follows from the matrix inversion lemma (see Henderson and Searle, 1981) and 
leads to 1„1„ = 1„1„ = n, 1„X^/ = l^X^ = Oj and l„y = l„y hy straightforward 
calculations. A most convenient property of the hyper-^ prior is that it yields a closed 
form marginal likelihood, which needs to be computed on the original response scale 
via the change of variables formula: 

f{y\d) « 

where y = n^^ IL'i=i Vii i^i is the Gaussian hypergeometric function (Abramowitz and Stegun, 
1964, p. 558) and is the classical coefficient of determination in model (8) (see Ap- 
pendix A.l for implementation details). 

Other hyper-priors could be assigned to g, but will typically not lead to a closed 
form of the marginal likelihood. Examples are the incomplete inverse-gamma prior 
on 1 + ^ (Cui and George, 2008, p. 891), which generalises the above uniform prior 
on g/ (1 + g), and an inverse-gamma prior on g, which corresponds to the Cauchy 
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{y - l„y) 



a +2) 



V 



1/2 



(12) 



prior of Zellner and Siow (1980). Liang et al. (2008) also describe a modified version 
of the hyper-g prior, called the hyper-g/ n prior, which is a special case of the conven- 
tional robust prior proposed by Forte (2011), for which a closed form of the marginal 
likelihood exists. 

Posterior inference in a given model d is based on Monte Carlo estimation of the 
parameters in model (7), using the factorisation 



/(i6o, iS„ Ud, cr\g\y)=f{ud\ ^o, )3„ a\ y)f{^o, jS, I o-\ g, y)f{a^ \ y)f{g \ y) . (13) 



Sampling of g, and subsequently j6o, can be done along the lines of Sabanes Bove and Held 
(2011b, section 2.3), by adapting their algorithm to the transformations in model (9). 
Finally, the spline coefficient vector is sampled from 



where L^j = (Z^Z^ + DJ^)^^ and disappears because Zjl„ = Qj^. 

Given posterior samples for the linear coefficient jSy and the spline coefficient vector 
Uj for covariate ; {dj > 1), we would like to transform these into samples for the 
function mj{xj), along a grid vector x* of n* points (on the same scale as the original xj 
used for the model fitting). This is in principle straightforward, but one has to carefully 
apply analogous transformations as in (4) and (5) to x* and the corresponding spline 
basis matrix Z*: 



Afterwards, for each coefficient sample one can compute the corresponding vector of 
function values mj{x^) = x*j}j + Z*jUj. Similarly, prediction samples for the corres- 
ponding response vector y* can be extracted from the sampling output. 




(14) 




(15) 



(16) 
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3. Hyper-^ Priors for Generalised Additive Models 



Now we extend the above setting and assume that the covariate effects nij[xj) enter 
additively into the linear predictor 

fj = l5o + 'tmj{xj) (17) 

7=1 

of an exponential family distribution with canonical parameter 6, mean E(y) = h{r]) = 
db{e)/de and variance Var(y) = (j)/iv ■ <fh{e)/de^ (see McCullagh and Nelder, 1989). 
We restrict our attention to non-normal distributions with fixed dispersion (p (as (p = \ 
for the Bernoulli and Poisson distribution) and known weight id. For n observations, 
the linear predictor vector tj = {rji, . . . , /y,,)^ is 

// = l„/5o + Xd)3^ + Zrfiirf (18) 

and the likelihood is 

fiy I /3o, u,) oc exp | £ | • (19) 

The main challenge for the derivation of a generalised g-prior is that the marginal 
density f{y \ fio, )8^), which results from integrating out the spline coefficient vector 

Ud-^NjKiOjK,Dd) (20) 

from (19), has no closed form. In particular, it is not Gaussian, in contrast to (8). 

Before addressing this problem we first consider appropriate construction of the 
design matrices and and calculation of the degrees of freedom dj{pj) for a 
smoothly modelled term nij. Starting with the latter, a reasonable generalisation of (6) 
is (see Ruppert et al., 2003, section 11.4) 

djipj) = tr{{ZjWZj + prhy'zJWZj} + 1, (21) 

which uses a fixed weight matrix W = W(l„/3o), where W(j/) = diag{(d//(/7,)/d?/)^/ Var(i/,)}"^ 
is the usual generalised linear model weight matrix and /3o is the intercept estimate 
from the null model d = Op. This definition avoids dependence of Pj{dj) on the model 
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d under consideration. As a consequence, we need to generalise the orthogonalisation 
of the original covariate vector Xj and spline basis matrix Zj from (4) and (5) to 



llWxj 

Xj = Xj -In „^ (22) 

llWZj xJWZj 
and Zj = Zj - 1„ ^ ' - Xj , (23) 

^ ^ \l^N\n 'xJWxj 

implying that 1,,, xj and the columns of Zj are orthogonal to each other with respect 
to the inner product in terms of W. This ensures that (21) correctly captures only the 
degrees of freedom associated with the nonlinear part of rrij. Note that (15) and (16) 
are adapted analogously. 

We will now derive a generalised g-prior analogous to (10) for the linear coefficient 
vector in the generalised additive model. The idea is to use the iterative weighted 
least squares (IWLS) approximation to (19) to obtain an approximate normal model of 
the form (7) and then derive the resulting g-prior (10). With a slight abuse of notation, 
e.g. h{t]) = {h{rii),...,h{y]n)Y, let 

zo = i]o + diag{dh{t]Q)/drj}-\y-h{t]Q)) (24) 

be the adjusted response vector resulting from a first-order approximation to h^^{y) 
around y = h{t]Q). Then 

zo I |6o, ~ N(l„/3o + Xd/S^ + ZdUd, W^') (25) 

with Wo = W(j/g) is the working normal model (see e.g. McCullagh and Nelder, 
1989, p. 40). Remember that the IWLS algorithm iteratively updates tj^ by weighted 
least squares estimation of the coefficients in (25). Here, we fix tj^ = 0„, which is the 
value expected a priori. Then we rewrite (25) using zq = Wq zq etc. as 

zo I i6o, ~ N(i„/3o + Xd^a + ZdUd, In), (26) 

which brings us back to a normal model of the form in (7). By computing the corres- 
ponding g-prior (10), we arrive at the generalised g-prior 

/SJg~NKOj,g7oi) (27) 
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with prior precision matrix proportional to 

Jo = + ^d^d^'d) ^^d 

= X]wy^{I„ + wy^ZaD,Z]wy^)-'wy^Xa. (28) 

An appealing feature of this prior is that it directly generalises the ^-prior proposed 
by Sabanes Bove and Held (2011a) for generalised linear models, to which it reduces 
when there are no spline effects in the model, /. e. Jq = X^WqX^. An alternative and 
more rigorous derivation of (28) as the Fisher information obtained from a Laplace 
approximation to the marginal model f{y \ j5o, )3^) is presented in Appendix B. 
The generalised hyper-^ prior 

f{h'^d'^d,g)= f{Mf{^Ag)f{g)f{u,) (29) 

is defined to comprise the locally uniform prior /(/5o) oc 1 on the intercept ^q, the gen- 
eralised ^-prior (27) on the linear coefficient vector the penalty prior (20) on the 
spline coefficient vector u^, and some proper hyper-prior f{g) on the hyper-parameter 
g. Posterior inference under this prior can be implemented by a straightforward exten- 
sion of the approach of Sabanes Bove and Held (2011a, section 3), which is outlined in 
the following. The efficient R-package "hypergsplines" for this and all other compu- 
tations in this paper is available from R-Forge.^ 

Let Xa = {ln,X^,Z^) and = (/3o,/S^, mJ)^ denote the grand design matrix and 
regression coefficient vector, respectively, such that t] = Xa^^. The prior for con- 
ditional on g has a Gaussian form with mean zero and singular precision matrix 
diag{0,g^^7o, DJ^}. Thus, the Gaussian approximation of /(/S^ | y,g,d), which is ne- 
cessary for the Laplace approximation of f{y \ g, d), can be obtained by the Bayesian 
IWLS algorithm (West, 1985). Afterwards, an approximation of the marginal likeli- 
hood of model d, 

00 

f{y\d) = j f{y\g,d)f{g)dg, (30) 



is obtained by numerical integration of the Laplace approximation f{y \g,d). Note 
that recently integrated Laplace approximations have successfully been applied in a 

^The website is http://r-forge.r-project.org/projects/hypergsplines. To install the R-package, 
just type install .packages ("hypergsplines" ,repos="http : //r-f orge .r-project . org") into R. 
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more general cor\text (Rue, Martino, and Chopin, 2009). Finally, we can use a tuning- 
free Metropolis-Hastings algorithm to sample from the joint posterior of and g in a 
specific model d. 



We propose a prior f{d) on the model space VP which explicitly corrects for the 
multiplicity of testing inherent in the simultaneous analysis of the p covariates (see 
Scott and Berger, 2010): A priori, the number of covariates included in the model (J) 
is uniformly distributed on {0,1,. . . ,p}. The choice of the J covariates is then uni- 
formly distributed on all possible configurations, and their degrees of freedom are 
independent and uniformly distributed on P \ {0} = {1,2,3, . . . ,X}. Altogether, this 
gives 



A nice property of this prior is that it leads to marginal prior probabilities P(dy = 0) = 
P(dy > 0) = 1/2. Elsewhere this is often achieved by assigning independent priors to 
the p covariates, which implies I ~ Bin(p, 1/2) and has the disadvantage that models 
with I K. p/2 included covariates are favoured by the prior. It is clear that our uniform 
prior on I allows the data y to have a maximum effect on the posterior of I because it 
is the reference prior (Bernardo, 1979). 

Alternatively, one might also use a fixed (independent of K) prior probability for a 
linear effect {dj = 1). This is appropriate for the situation where one explicitly wants to 
test linearity versus nonlinearity of each effect. Furthermore, a multiplicity correction 
for these tests can be implemented by assuming that the number of smoothly included 
covariates (/) is uniformly distributed on {0, 1, ... , /} and their choice is uniform on 
all possible choices. This would add one level to the prior hierarchy. 

As the model space T>P grows exponentially in the number of covariates p, only 
for small values of p all possible models can be evaluated. Otherwise the marginal 
likelihood f{y \ d) can be computed only for a subset of the model space. Usually 
this subset is determined by stochastic search procedures. Here we propose to use a 
simple Metropolis-Hastings algorithm with two possible move types in the proposal 
kernel: 



4. Model Prior and Stochastic Search 




(31) 
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Move Sample a covariate index / ~ U{1,2, ■ ■ ■ ,p} and decrease or increase dj to the 
next adjacent value in V (with probability 1/2 each, or deterministically if dj = 
or dj = K, respectively). 

Swap Sample a pair ~ U{(1, 1), (1,2), . . . ,{p,p)} of covariate indices (/ < /) and 
swap dj and dj. 

The 'Swap' move is designed to efficiently trace models with high posterior probability 
even in situations where covariates are almost collinear. For each Metropolis-Hastings 
iteration, a 'Move' is chosen with some fixed probability (we use 3/4), and otherwise 
a 'Swap'. Denote the current model by d, then the proposed model d' is accepted with 
probability 

'^^'^ '''^-'^ fiy\d)fidMd\d') 
where the calculation of the proposal probability ratio q{d' \ d)/q{d \ d') is straightfor- 
ward, see Appendix A.2. 

The advantage of such an MCMC based model exploration compared to more elab- 
orate stochastic search algorithms (e. g. Hans, Dobra, and West, 2007; Clyde, Ghosh, and Littman, 
2011) is that it does not preclude estimation of posterior probabilities via sampling fre- 
quencies, as it was originally proposed for MCMC model composition by Madigan and York 
(1995). Recently reported problems with renormalized probability estimates (Clyde and Ghosh, 
2010; Garcia-Donato and Martinez-Beneito, 2011) can be avoided by using the model 
sampling frequencies instead. Nevertheless, other search procedures might be be- 
neficial when only the maximum a -posteriori (MAP) model and not e.g. the marginal 
posterior inclusion probabilities for the covariates are of interest. 

5. Applications 

We illustrate the use of the hyper-^ prior for additive modelling in Section 5.1 and for 
logistic regression in Section 5.2. 

5.1 . Ozone Data 

We apply our additive modelling approach to the ozone data from Breiman and Friedman 
(1985) on the association between p = 9 meteorological covariates and the maximum 
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one-hour average ozone concentration for n = 330 days in 1976 (see Table 1 for details). 
As the computational complexity of the marginal likelihood (12) is cubic in the spline 
basis dimension K (see Appendix A.l), we want to use splines with few, quantile- 
based knots. Therefore, we choose cubic O'Sullivan splines (Wand and Ormerod, 
2008). Here, we get basis matrices Zy with K = 6 columns from 4 inner knots at 
the quintiles. 



Variable 


Description 


y 


Maximum 1-hour average ozone level [ppm] 


X\ 


Day of the year 


Xl 


500 millibar pressure height [m] 


X3 


Wind speed [mph] 


X4 


Relative humidity [%] 


X5 


Temperature at Sandberg, CA [°F] 


Xe 


Inversion base height [feet] 




Pressure gradient [mm Hg] 


Xg 


Visibility [miles] 


X9 


Inversion base temperature [°F] 



Table 1 - Description of the variables in the ozone data set. 



Exhaustive evaluation of the posterior model probabilities f{d \ y) cx f[y | d)f{d) 
of all {K + lf = 40353607 models takes 531 minutes on a standard 2.8 GHz CPU. 
We also applied the stochastic search algorithm described in Section 4 with 10^ iter- 
ations, which took 2 minutes and resulted in 125 619 models. 92.4% of the posterior 
model probability have been discovered and the 1858 top models were found by this 
algorithm. 

In Table 2 the marginal posterior probabilities for linear and smooth inclusion of 
the nine covariates are shown. While xi, x^, xy and xg are clearly included as smooth 
terms, there is considerable uncertainty for the other covariates whether to be included 
linearly or smoothly. Only the overall inclusion probability for Xg is below 50%. 

The MAP model includes smooth terms for Xi, X5, X/, xg and Xg. The covariates 
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X2 


X3 


X4 


X5 


X6 


x? 


^8 


X9 


not included {dj = 0) 


0.00 


0.02 


0.33 


0.06 


0.00 


0.59 


0.00 


0.00 


0.28 


linear {dj = 1) 


0.00 


0.46 


0.32 


0.35 


0.00 


0.14 


0.00 


0.05 


0.22 


smooth {dj > 1) 


1.00 


0.52 


0.35 


0.59 


0.99 


0.28 


1.00 


0.95 


0.50 



Table 2 - Marginal posterior inclusion probabilities in the ozone data set. 



X2 and Xi are included linearly while X3 and are not included. Figure 1 shows 
the estimated covariate effects, which were obtained from 10000 posterior samples. 
Note that for linear functions mj, the pointwise credible intervals coincide with the 
simultaneous credible intervals (Besag, Green, Higdon, and Mengersen, 1995, p. 30). 
This is because all straight lines samples intersect in one point, which is {xj, 0) due to 
the centring of the covariates in (4). 

In comparison to the MAP model in Sabanes Bove and Held (2011b, section 4) based 
on Bayesian fractional polynomials, similar functional forms are estimated for the ef- 
fects of Xi, X4 and X5, while differences are visible for xy and xg. Note that xg is in- 
cluded in the MAP model in Sabanes Bove and Held (2011b). See also Casella and Moreno 
(2006) for an objective Bayesian variable selection analysis (without the possibility of 
smooth effects) of this data set. 

5.2. Pima Indian Diabetes Data 

We now apply the generalised additive modelling approach to the logistic regression 
of p = 7 potential risk factors on the presence of diabetes in n = 532 women of Pima 
Indian heritage (Frank and Asuncion, 2010; Ripley, 1996), see Table 3 for details. As 
for the ozone data set, we use cubic O'Sullivan splines with inner knots at the quintiles 
and a uniform hyper-prior on g/ + 1), and explore the model space of dimension 
7'' = 823 543 with 10^ iterations of the stochastic search algorithm. The computational 
complexity is higher than for the normal response case, with 218 minutes required for 
the evaluation of 43 766 models. We validated the results with an exhaustive evaluation 
of all models, requiring 94 hours. Indeed, the stochastic search found 98.3% of the 
posterior probability mass and the 489 top models. 
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Figure 1 - Estimated covariate effects in the MAP model for the ozone data, based on 10000 samples: 
Posterior means (solid lines), pointwise (dashed lines) and simultaneous (dotted lines) 95%- 
credible intervals are shown. 
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Variable 


Description 


y 


Signs of diabetes according to WHO criteria (Yes = 1, No = 0) 


X\ 


Number of pregnancies 


X2 


Plasma glucose concentration in an oral glucose tolerance test [mg/ dl] 


X3 


Diastolic blood pressure [mm Hg] 


X4 


Triceps skin fold thickness [mm] 


X5 


Body mass index (BMI) [kg/m^] 


xe 


Diabetes pedigree function 


X7 


Age [years] 



Table 3 - Description of the variables in the Pima Indian diabetes data set. 





Xl 


Xl 


X3 


X4 


X5 


X6 


x? 


not included {dj = 0) 


0.63 


0.00 


0.81 


0.84 


0.00 


0.02 


0.01 


linear (dj = 1) 


0.09 


0.48 


0.09 


0.06 


0.11 


0.26 


0.00 


smooth [dj > 1) 


0.28 


0.52 


0.09 


0.10 


0.88 


0.72 


0.99 



Table 4 - Marginal posterior inclusion probabilities in the Pima Indian diabetes data set. 



In Table 4 the marginal posterior probabilities for linear and smooth inclusion of 
the covariates are shown. There is clear evidence for inclusion of the covariates X2, X5, 
Xg and X7, which have posterior inclusion probabilities over 98%. For the other three 
covariates, the inclusion probability is below 40%. Smooth modelling of the effects of 
X5, Xg and X7 seems to be necessary, while this is not so clear for X2. 

Figure 2 shows the estimated covariate effects in the MAP model which features a 
linear term for X2 and smooth terms for X5, Xe and X7. The estimates are obtained from 
10000 MCMC samples.^ Note that the Chib and Jeliazkov (2001) estimate (-243.426, 
MCMC standard error 0.008) of the log marginal likelihood of the MAP model, which 
was also computed, is quite close to the integrated Laplace approximation (—243.511). 

^Every 2nd sample was saved after burning the first 1000 iterations, with acceptance rate 67% using two 
IWLS steps per proposal. 
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Figure 2 - Estimated covariate ejfects in the MAP model for the Pima Indian diabetes data set, based 
on 10000 MCMC samples: Posterior means (solid lines), pointwise (dashed lines) and sim- 
ultaneous (dotted lines) 95%-credible intervals are shown. 

This indicates that the integrated Laplace approximation is fairly accurate. 

The results are qualitatively similar to those obtained with a fractional polynomial 
modelling approach by Sabanes Bove and Held (2011a, section 5) and with a cubic 
smoothing spline approach by Cottet et al. (2008, section 3.2). It is interesting that in 
the earlier work by Yau et al. (2003, section 5.2), a very low posterior inclusion prob- 
ability (0.07) for was reported for a different subset of the original Pima Indian 
diabetes data set. If pure variable selection without covariate transformation is con- 
sidered, as in Holmes and Held (2006, section 2.6) and Sabanes Bove and Held (2011a, 
section 4), the strong nonlinear effect of is missed completely, and instead x-[ gets 
a higher posterior inclusion probability. This highlights the importance of allowing 
nonlinear covariate effects. 
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6. Postprocessing 



Given the list of all possible models d G VP, or a subset found by the stochastic search 
procedure described in Section 4, one may consider postprocessing the results. 

First, when the main interest lies in variable selection, the models which feature the 
same covariates can be summarised into a meta-model: The posterior probabilities of 
the sub-models are summed up to give the posterior probability of the meta-model, 
and estimates in the meta-model are obtained by averaging the sub-models with 
weights proportional to their posterior probabilities (see e. g. Hoeting, Madigan, Raftery, and Volinsky, 
1999, for model averaging). For example, the best meta-model for the ozone data fea- 
tures all covariates except and has posterior probability 0.261. The corresponding 
estimates of the covariate effects are shown in Figure 3. For the Pima Indian diabetes 
data the meta-model with highest posterior probability (0.466) includes X2, X5, and 
xy. In both examples, the best meta-model happens to be identical with the median 
probability meta-model, which features all covariates having marginal posterior in- 
clusion probability greater than 50% (Barbieri and Berger, 2004), cp. Tables 2 and 4. 
Similarly, it could be interesting to summarise models which only differ in the de- 
grees of freedom for smooth terms. This would correspond to the situation of testing 
linearity versus nonlinearity of covariate effects (cp. Section 4). 

Second, in order to allow for continuous degrees of freedom, one can optimise the 
marginal likelihood of the MAP model with respect to the degrees of freedom of the 
covariates included. That is, an optimisation of f{y \ d) over the continuous range 
{1,K + 1) is performed for all dj's with > in the MAP model. For example, 
the MAP configuration for the ozone data is (4,1,0,1,3,0,4,2,2) and the resulting 
optimised configuration is (4.35,1,0,1.08,3.44,0,3.63,2.3,1), which increases the log 
marginal likelihood from —1413.86 to —1412.91. Although rfg decreases from 2 to 1 
(rounded down to 2 decimals), the function estimates are very similar to those from 
the MAP model in Figure 1. For the Pima Indian diabetes data, the log marginal 
likelihood increases from -243.51 for the MAP model (0,1,0,0,3,2,4) to -243.39 for 
the optimised configuration (0, 1, 0, 0, 3.36, 2.09, 3.73). Again the function estimates are 
omitted because they are close to those from the MAP model in Figure 2. 
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Figure 3 - Estimated covariate effects in the best meta-model (and median probabilih/ meta-model) for 
the ozone data, based on 20 000 samples: Posterior means (solid lines), pointwise (dashed 
lines) and simultaneous (dotted lines) 95%-credible intervals are shown. 
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7. Discussion 



Our Bayesian approach to simultaneous variable and function selection in gener- 
alised regression is based on fixed-dimension spline bases and penalty-parameter 
smoothness control. In this respect, it differs from knot-selection approaches such as 
Smith and Kohn (1996) and Denison, Mallick, and Smith (1998). We are only consid- 
ering penalties on a fixed grid of values, which scales automatically for each covariate 
via the degrees of freedom transformation. In this regard, our approach is close to 
many popular Lasso-type proposals, which optimise the tuning-parameters on a fixed 
grid via cross-validation (e. g. Zou and Hastie, 2005). Cantoni and Hastie (2002) pro- 
pose a likelihood-ratio-type test statistic to compare additive models with different 
degrees of freedom. Fong, Rue, and Wakefield (2010) use a similar scaling to examine 
the prior on the degrees of freedom implied by the prior on the variance component 
in a generalised linear mixed model. They also use O'Sullivan spline bases as we did 
in our examples, but they do not consider variable selection. 

In a frequentist setting, Marra and Wood (2011, section 2.1) propose to use an addi- 
tional penalty on the linear part of the spline function in order to shrink it adaptively 
to zero. To include variable selection, a lower threshold for the effective degrees of 
freedom must be chosen. Our generalised g-prior (27) also shrinks the linear parts of 
the spline functions to zero, where the prior covariance matrix takes the correlations 
between the covariates into account. Furthermore, we explicitly ex- or include covari- 
ates and then compare the resulting models based on their posterior probabilities. 

We propose a conventional prior for the intercept and the linear coefficients, which 
directly generalises the hyper-^ prior in the linear model (Liang et al., 2008) and in the 
generalised linear model (Sabanes Bove and Held, 2011a). Pauler (1998) proposes a 
related unit-information prior for the fixed effects in linear mixed models, but fixes g = 
n in (10). Overstall and Forster (2010) propose a unit-information prior for the fixed 
effects in generalised linear mixed models, but the information matrix is based on the 
first-stage likelihood and not on the integrated likelihood as in our approach. Also, no 
hyper-prior on the parameter g is considered, because it is fixed at ^ = n. As they use 
an inverse-Wishart prior on the covariance matrix of the random effects, their approach 
is perhaps better suited to generic random effects models. Forster, Gill, and Overstall 
(2010) propose a novel reversible-jump MCMC algorithm to infer posterior model 
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probabilities. 

In future work, we would like to combine the semiparametric splines with the 
parametric fractional polynomials (Sabanes Bove and Held, 2011b). The idea is that 
a smooth term mj{xj) could also be modelled by a fractional polynomial instead of a 
spline. This extension could be implemented coherently, because the prior formula- 
tions are compatible. With such a general framework, the important question whether 
a parsimonious fractional polynomial {e.g. my = xyjiyi + Xyfiy2 in the Pima Indian 
diabetes data example) is sufficient could be answered via posterior probabilities (see 
Strasak, Umlauf, Pfeiffer, and Lang (2011) for a simulation study comparing the step- 
wise fractional polynomial approach by Royston and Sauerbrei (2008) with penalised 
spline approaches). 

A. Implementation details 

Section A.l gives details on the computation of the marginal likelihood (12) for nor- 
mal additive models. Section A.l derives the proposal probabilities for the stochastic 
search described in Section 4. 

A.1. Marginal likelihood computation 

A Cholesky factorisation of the covariance matrix G ]R"^" has complexity C(n'^) 
and is therefore computationally expensive. Therefore, it is advisable to avoid it and 
instead work with the formula 

Vf = I„-Z,M,'zj 

for the precision matrix, where = ZjZ^j + D^^. The latter matrix has dimension 
}K, which is usually smaller than n, provided the spline basis dimension K is small. 
Thus, the Cholesky factorisation = M^J^M^J^ is relatively fast, and we compute 

For the coefficient of determination R^ = SSM^/ SST^j, we need to compute the sum 
of squares in total (SST^j) and the sum of squares explained by the model (SSM^). For 
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SST^, we have 



SSTa = {y-l„yfv/{y-lny) 



|W^(t/-l„y)f. 



-(n-l) 



ssr 



-{n-\)ll 



Note that the first term in (12) can be written as (1/ — In]/) 

For SSMd, note that the fit of the general linear model is y^ = l„y + X^^^, where 

is the weighted least squares estimate of Therefore 

SSM^ = {y^-inyYv-,\y,-iny) 

can be computed by Cholesky factorising X^V^^X^ = C^Cd, solving the triangular 
system C^v^ = X^V^^y and setting SSM^ = H^'dH^- 

Finally, to compute the determinant term in (12), we can again avoid factorising V^, 
because we have 



1/2 



1/2 



In-WdWl = IjK-WlWd 



1/2 



see Harville (1997, p. 416) for the last equation. So again only a matrix of dimension 
]K, namely Ijk — Wjw^, needs to be factorised. Here, a LU factorisation can be used. 



A.2. Proposal probabilities 

First note that the two proposal types 'Move' and 'Swap' do not overlap, because a 
'Move' always changes exactly one dj, while a 'Swap' either changes none or two dj's. 
Denote with p,„ the probability to choose a 'Move'. 

Suppose a 'Move' was proposed for covariate ; G {0, 1, ... , p}. We then have 



and analogously 



q{d' \d) = pn 



q{d I d') = p„ 



1, d^G{0,K}, 
i, else 



1, d'.e{0,K}, 
I, else 
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with proposal ratio 



q{d'\d) 
Cj{d I d') 



1, else. 



For the 'Swap' proposal, suppose covariates i and ; are proposed to interchange 
their model parameters dj and dj. Of course, if d, = dj, then the proposal ratio equals 
unity because d' = d. In the other case, both model parameters are changed, and 

q{d'\d)=q{d\d') = {l-p„r)-(f 



so that for a 'Swap' we always have q{d' \ d) / q{d \ d') = 1. 



B. Approximate Fisher Information for non-normal response 

In this section, we present a formal derivation of (28) as the approximate Fisher in- 
formation obtained from a Laplace approximation to f{y \ j6o, For ease of notation 
we restrict the presentation to canonical response functions where rj = 6 and omit 
subscripts where they are not necessary for understanding. With O = diag{^/ Wi}"^^, 
we can then rewrite the likelihood (19) as 

/(i/|i6o,)8,M) cxexp{/a)-iv-l^<J>"^Kv)}- (32) 

We will now use the Laplace approximation to integrate (32) over u with respect to the 
prior u ~ N(0, D). The Laplace approximation has proven to be useful in the general- 
ised linear mixed model framework, see e. g. Breslow and Clayton (1993). Specifically, 
the use of the Laplace approximation for penalised spline smoothing has been invest- 
igated by Kauermann, Krivobokova, and Fahrmeir (2009), who prove its asymptotic 
validity. 

For the Laplace approximation of f{y \ /3o,/3) we first need to maximise the unnor- 
malised log posterior of u, 

/(h) =log/(i/|/3o,i8,M)+log/(M) 

= y'^^^rj - l^a>"^&(j/) - -u^D^u + const, (33) 
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where /5o and ^ in tj = IjSo + Xj8 + Zu are considered to be fixed. The corresponding 
score vector is 

d 



du 



l{u) = Z^^^y - Z^diag{b'(v)}a)-h - D^u 
= Z'^^-^{y -}i) -D-^u, 



where }i = b'(t]), and the corresponding Hessian is 

= -Z'^W{t])Z- D-^. 

Making one Newton-Raphson step from the starting point u = 0, we get the approx- 
imate mode u* of l{u): 

-1 J 







d d 
du du^ 



m 



du 

-1 



m 



Z^W{r]^)Z + D-'J Z^^-Hy-fi^), (34) 

where tj^ = + and = Note that this corresponds to the result of a 

second-order Taylor expansion of Z(m) around u = 0. Hence, the Laplace approxima- 
tion of f{y I /3o, j8) is 

-1/2 



/(i/|/3o,iS)cxexp(Z(M*))(27r)/^/2 



d d 



du du 



jl{u*) 



exp ( y^^^t]* - l^^-^b{ti*) - -u*^D-^u* 



X (27r)^^/2 z'^W{tj*)Z + D 



-1 



-1/2 



(35) 



where }K is the dimension of u. 

In order to derive the approximate Fisher information of ^ from f{y\fio,^), we 
make two additional simplifying assumptions: First, we assume that W{t]) does not 
vary much in )3, so that we can ignore the determinant in (35), for example. This is 
a common simplification, suggested e. g. in Breslow and Clayton (1993). Second, we 
approximate b{t]*) by a second-order Taylor expansion of b{t]) around j/^, yielding 

l^<^-^b{ti*) ^ l^<^-^b{ri^) + fil<^-^Zu* + ^u*^Z'^WiZu*, 
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where Wi = W{tj^). Using these two simplifications and plugging in (34), we arrive 
at the expression 

log/(i/ 1 /5o,)3) = y^^^tj^ - 

+ 1(2/ - ji^)^<^-^Z{Z''Wi^Z + D-i)-iZ^O-i(i/ - (36) 

for the approximate marginal log-likelihood of /3o and /3. From (36) we can finally 
approximate the Fisher information /(/3o, /S) = log/(!/ 1 /^Oz/S) as 

= x'^wY^ (^i - wY^z{z'^WiZ + D-'^y'^z'^wY^^ wY^x (37) 

= X^wY^il + wY^ZDZ'^wY^y^wY^X. (38) 

Evaluating the approximate Fisher information at /Sq = 0, )3 = 0, such that Wi = 
W(0), we recognise that /(0, 0) from (38) is identical to /g in (28). Note that the repres- 
entation (37) can be better suited for computation: the first paragraph of Appendix A.l 
applies here after replacing Z^j with Wq Z. 
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